Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.9 - Filling in Null Values")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None
3 3 2 Maple 2018-11-22 10:05:10 17 white
4 4 2 None 2019-01-01 10:05:10 13 None

Filling in Null Values

Working with real world data, some information can be missing but can be interprelated from other columns or set with default values. These interprelated values or deafult value will thus fill in those missing values. Here we will show you how to.

Option 1 - Fill in All Missing Values with a Default Value

(
    pets
    .fillna('Unknown')
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 Unknown
2 3 1 Chewie 2016-11-22 10:05:10 15 Unknown
3 3 2 Maple 2018-11-22 10:05:10 17 white
4 4 2 Unknown 2019-01-01 10:05:10 13 Unknown

What Happened?

Using fillna we attempt to fill in all Null values with the value 'Unknown'. This will be fine if all the Null values are strings, but it won't work if for say the age column is missing values. We look at how to specify different values for different columns next.

Option 2 - Fill in All Missing Values with a Mapping

(
    pets
    .fillna({
        'nickname': 'Unknown Nickname',
        'color':    'Unknown Color',
    })
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 Unknown Color
2 3 1 Chewie 2016-11-22 10:05:10 15 Unknown Color
3 3 2 Maple 2018-11-22 10:05:10 17 white
4 4 2 Unknown Nickname 2019-01-01 10:05:10 13 Unknown Color

What Happened?

You have the option of filling in each column with a diffferent value. This provides more flexibility as most times the columns will be different types and a single deafult value won't be sufficient enough.

Option 2 - coalesce()

(
    pets
    .withColumn('bogus', F.coalesce(F.col('color'), F.col('nickname'), F.lit('Default')))
    .toPandas()
)
id breed_id nickname birthday age color bogus
0 1 1 King 2014-11-22 12:30:31 5 brown brown
1 2 3 Argus 2016-11-22 10:05:10 10 None Argus
2 3 1 Chewie 2016-11-22 10:05:10 15 None Chewie
3 3 2 Maple 2018-11-22 10:05:10 17 white white
4 4 2 None 2019-01-01 10:05:10 13 None Default

What Happened?

Another way to fill in a column with values is using coalesce(). This function will try to fill in the specified columns by looking at the given arguments in order from left to right, until one of the arguments is not null and use that. If all else fails, you can provide a "default" value as your last arugment (remembering that it should be a columnar expression).

In our example, it will attempt to fill in the bogus column with values from the color column first, if that is null then try the nickname column next, and if both are null it will use the deafult value Default.

Summary

  • We looked at a generic way of filling in a columns with a single default value.
  • We looked at providing a mapping of {column:value} to fill in each column seperately.
  • Lastly we looked at how to fill in a column using other columns and default values in an order of precedency.

results matching ""

    No results matching ""